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Abstract. Algebraic statistics is a recently evolving field, where one would treat 
statistical models as algebraic objects and thereby use tools from computational 
commutative algebra and algebraic geometry in the analysis and computation of 
statistical models. In this approach, calculation of parameters of statistical models 
amounts to solving set of polynomial equations in several variables, for which one 
can use celebrated Grobner bases theory. Owing to the important role of information 
theory in statistics, this paper as a first step, explores the possibility of describing 
maximum and minimum entropy (ME) models in the framework of algebraic statistics. 
We show that ME-models are toric models (a class of algebraic statistical models) when 
the constraint functions (that provide the information about the underlying random 
variable) are integer valued functions, and the set of statistical models that results 
from ME-methods are indeed an affine variety. 
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1. Introduction 

Algebra has always played an important role in statistics, a classical example being 
linear algebra. There are also many other instances of applying algebraic tools in 
statistics (e.g (Viana & Richards 2001)). But, treating statistical models as algebraic 
objects, and thereby using tools of computational commutative algebra and algebraic 
geometry in the analysis of statistical models is very recent and has led to the still 
evolving field of algebraic statistics. 

The use of computational algebra and algebraic geometry in statistics was initiated 
in the work of Diaconis and Sturmfels (Diaconis Sz Sturmfels 1998) on exact hypothesis 
tests of conditional independence in contingency tables, and in the work of Pistone et 
al. (Pistone et al. 2001) in experimental design. The term 'Algebraic Statistics' was first 
coined in the monograph by Pistone et al. (Pistone et al. 2001) and appeared recently 
in the title of the book by Pachter and Sturmfels (Pachter & Sturmbfels 2005). 

To extract the underlying algebraic structures in discrete statistical models, 
algebraic statistics treat statistical models as affine varieties. (An affine variety is the 
set of all solutions to family of polynomial equations.) Parametric statistical models are 
described in terms of a polynomial (or rational) mapping from a set of parameters to 
distributions. One can show that many statistical models, for example independence 
models, Bernouli random variable etc. (see (Pachter & Sturmbfels 2005) for more 
examples), can be given this algebraic formulation, and these are referred to as algebraic 
statistical models. 

Exponential models, which form the important class of statistical models, are 
studied in algebraic statistics under the name 'toric' models by using maximum 
likelihood methods. Toric models are algebraic statistical models and the term 'toric' 
comes from important algebraic objects known as 'toric ideals' in computational algebra. 
In this view of very established role of information theory in statistics (Kullback 1959, 
Csiszar & Shields 2004) this paper attempts to describe maximum entropy models in 
algebraic statistical framework. 

In particular, we show that maximum entropy models (also minimum relative- 
entropy models) are indeed toric models, when the functions that provide the 
information about the underlying random variable in the form of expected values are 
integer valued. We also show that when the information is available in the form of sample 
means, by modifying maximum entropy prescriptions calculating model parameters 
amounts to solving set of polynomial equations. This establishes a fact that set of 
statistical models results from maximum entropy methods are indeed algebraic varieties. 

A note on the results presented in this paper: we will not present the details on 
Grobner bases theory and related concepts to solve the polynomial equations due to 
space constraint; we refer reader to text books on computational algebra and Grobner 
basis theory (Adams & Loustaunau 1994, Cox et al. 1991). 

We organize our paper as follows. In § [2] we give basic notions of algebra and 
introduce notation along with an introduction to algebraic statistics. § [3] describes 
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maximum entropy (ME) prescriptions in algebraic statistical framework by introducing 
important algebraic objects called toric ideals. In § H] we show how one can transform 
the problem of calculating ME distributions to solving set of polynomial equations. 

2. Algebraic Statistical Models 

2.1. Basic notions of Algebra 

Through out this paper k represents a field. A monomial in n indeterminates x±, . . . , x n 
is a power product of the form x" 1 . . . x® n , where all the exponents are nonnegative 
integers, i.e. ctj G Z> , i = l,...n. One can simplify the notation for monomial as 
follows: denote a = (a%, . . . , a n ) G Z> and by using multi- index notation we set 

™« _ ™ a l T «n 
•Xj tXj ^ • • • ikj „ 

with the understanding that x = (xi, . . . ,x n ). Note that x a = 1 when ever a = 

(0, . . . , 0). Once the order of the indeterminates are fixed, monomial X 1 • • . X y-^ X lS 

identified by (cti, . . . , a n ). Hence, set of all monomials in indeterminates Xi,...,x n 
can be represented by Z> . Theory of monomials is central to the celebrated 
Grobner bases theory in computational algebra which provides tools for solving 
set of polynomial equations and related problems in algebraic geometry (Mishra & 
Yap 1989). Monomial theory itself plays important role in algebraic statistics in the 
representation of exponential models where probabilities are expressed in terms of power 
products (Rapallo 2006). 

A polynomial / in Xi, . . . , x n with coefficients in A; is a finite linear combination of 
monomials and can be written in the form of 



where Af C Z> is a finite set and a a G k. The collection of all polynomials in the 
indeterminates Xi,...,x n is the set k[xx,...,x n ] and it has structure not only of a 
vector space but also of a ring. Indeed the ring structure of k[xi, . . . ,x n ] plays main 
role in computational algebra and algebraic geometry. 

A subset a C k[xi, . . . , x n ] is said to be ideal if it satisfies: (i) G o (ii) f,gEa, 
then / + g G a (iii) / G a and h G k[x\, . . . ,x n ] and then hf G a. A set V C k n is said 
to be affine variety if there exists f\, ■ ■ ■ , f s G k[x±, . . . , x n ] such that 



We use the notation V(/i, . . . , f s ) = V . 
2.2. Algebraic Statistical Model 

At the very core of the field of algebraic statistics lies the notion of an 'algebraic 
statistical model'. While this notion has the potential of serving as a unifying theme for 
algebraic statistics, there is no unified definition of an algebraic statistical model (Drton 




V = {( Cl , ■ ■ ■ c n ) G k n : fi{ Cl , . . . cn) = 0, 1 < i < s} . 
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& Sullivant 2006). Here, we adopt the appropriate definition of statistical model 
from (Pachter & Sturmbfels 2005, Drton & Sullivant 2006). For a recent elaborate 
discussion on formal definition of algebraic statistical models one can refer to (Drton & 
Sullivant 2006). 

Let X be a discrete random variable taking finitely many values from the set 
[m] = {1,2, ...m}. A probability distribution p of X is naturally represented as a 
vector p — (pi, . . . ,p m ) E M. m if we fix the order on [m]. Then set of all probability mass 
functions (pmfs) of X is called probability simplex 

m 

A m _x = {p= ( Pl , . . ., p m ) E ]R> : Y,P> = !} • (!) 

8=1 

The index m — 1 indicates the dimension of the simplex A m _i. A statistical model A^ 
is a subset of A m _i and is said to be algebraic if 3f±, . . . , f s G k[pi, . . . ,p m ] such that 

M = V(/i,... ) /,)nV 1 . 

Now we move on to parametric statistical models and their algebraic formulations. 

Let 6 C l d be a parametric space and k : G — > A m _i be a map. The image 
«(©) is called parametric statistical model. Given a statistical model Ai C A m _i, by 
parametrization of Al we mean, identifying a set G C M. d and a function k : G — > A m _i 
such that Al = k(G). To describe more general statistical models in algebraic framework 
we need following notion of semi- algebraic set. 

Definition 2.1. A set Q C M. d is called semi- algebraic set, if there are two finite 
collection of polynomials F C k[xi, . . . , xj] and G C k[xi, . . . , Xd] such that 

G = {6 G R d : f(0) = 0,V/ eFandg(6) > 0,g G G} . 

Now we have following definition of parametric algebraic statistical model. 

Definition 2.2. Let A m _! be a probability simplex and Q C M. d be a semi- algebraic 
set. Let k : M. d — > W 71 be a rational function ( a rational function is a quotient of two 
polynomials) such that k(G) C A m _i. Then the image M. = k(G) is a parametric 
algebraic statistical model. 

Conversely, a parametric statistical model M. = k(Q) C A n _i is said to be algebraic 
if G is semi- algebraic set and k is a rational function. From now on we refer to 
'parametric algebraic statistical models' as 'algebraic statistical models'. 

In this paper we consider following special case of algebraic statistical models 
(cf. (Pachter & Sturmbfels 2005, pp 7)). Consider a map 

k : G(C R d ) -> R m 

K:e=(6 1 ,...,9 d )^(K 1 (9),...,K m (9)) (2) 

where k« G k [8i , . . . , 9d] ■ We assume that Q satisfies Ki(9) > 0, i = l,...,m and 
Y^iLi K i(@) = 1 f° r an y 9 E Q. Under these conditions k(Q) is indeed an algebraic 
statistical model (Definition 12 . 21) since k(Q) C A m _i, k is a polynomial function and 
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is a semi-algebraic set (H = {Y^Li fi ~ 1} an d G = {fi : % — 1, . . . , m} in the 
Definition [2/1]). 

Some statistical models are naturally given by a polynomial map k ([2]) for which 
the condition Y^T=i = 1 does not hold. If this is the case one can consider following 
algebraic statistical model: 

K : 9 = (6 U . . . , 9 d ) ^ 1 ( Kl {6), K m {6)) , (3) 

assuming that remaining conditions that have been specified for the model ([2]) are valid 
here too. The only difference is that instead of k being a polynomial map, we have it 
rational map. 



3. ME in algebraic statistical setup 

3.1. Toric Models 

In the algebraic description of exponential models monomials and binomials play a 
fundamental role. The study of relations of power products lead to the theory of 
toric ideals in the commutative algebra (Sturmbfels 1996). Here we describe basic 
notion of toric ideal that are relevant to representation and computation of discrete 
exponential models; for more details on theory and computation of toric ideals one can 
refer to (Sturmbfels 1996, Bigatti & Robbiano 2001, Bigatti et al. 1999). 

Before we give the definition of toric ideal, we describe the notion of Laurent 
polynomial. If we allow negative exponents in a polynomial i.e., polynomial of the 
form / = XlaeA/ a aX a where a G Z n , it is known as Laurent polynomial (Aj C Z™ is 
finite). Set of all Laurent polynomials in the indeterminates xi, . . . ,x n is denoted by 
k[xf, . . . , x^] and it also has a structure of a ring. 

Now we define the toric ideal. 

Definition 3.1. Let A = [a^] G Z dx " be a matrix with rank d. Consider the ring 
homeomorphism 

tt : k[x u . ..,x n ) -> k[9f , ...,9f) 

7r : x 3 0^ . . . 6^ (4) 

The toric ideal cu of A is defined as the kernel of the map it, i.e., cu = kern. 

The mapping n can be viewed as "parametrization" and which can be explained 
by the following description of 7T. Consider a map 

7r : Z™q — > 1j d 

tt : u — (iti, . . . , u n ) t— > Au. (5) 

The map 7T lifts to the ring homomorphism n in the sense of action of 7r on x u = 
x" 1 . . . x^ n G k[xx, . . . ,x n ]. That is 

/ d \ U1 / d \ 

ttoo ..,<-)= (n^ 1 ) •••(n^ n ) ( 6 ) 
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En 

4=1 

Toric ideal theory plays an important role in applications of computational algebraic 
geometry like integer programming etc.(cf. (Sturmbfels 1996)). Note that in the 
algebraic descriptions of exponential models and their maximum likelihood estimates 
only non- negative cases of toric ideals (and hence toric models) is considered i.e., the 
matrix A = [a^] in Definition 13.11 is assumed to be nonnegative and the map (j3J) 
is specified as 7r : k[xi, . . . , x n ] — > k[9\, . . . ,9d] (see (Pachter & Sturmbfels 2005)). 
As described later in this paper, in the algebraic descriptions of maximum entropy 
models one has to deal with the Laurent polynomials and hence one has to include the 
negative case in the definitions of toric ideals and toric models. This poses no problem 
because toric ideal theory in commutative algebra naturally includes the negative case 
(as in Definition 13.11) and Grobner bases theory can be extended to Laurent polynomial 
ring (Pauer & Unterkircher 1999). 

The concept of toric ideals let to the description of exponential models under the 
name toric models in algebraic statistics which is defined as follows. 

Definition 3.2. Let A G Z>Q m be a matrix such that the vector (1, . . . , 1) G Z> is in 

the row span of A. Let h G R> be a vector of positive real numbers. Let = M> and 
let n A ' h be the rational parametrization 

K A ' h : 6 -> W rn 



A,h 



e » z(e)-%l[0F , (8) 

1=1 

where 9 = (9%, . . . ,9^) and Z(9) is the appropriate normalizing constant. The toric 
model is the parametric algebraic statistical model 

M A , h = K A ' h (Q) ■ (9) 

Independence models, exponential models, Markov chains and Hidden Markov 
chains can be given an algebraic statistical description by means of toric models (Pachter 
& Sturmbfels 2005). We keep positivity of A in the Definition 13.21 as a matter of 
convention. 



3.2. ME in terms of Toric Models 

Let X be a random variable taking values from the set [m] = {1,2, . . .m}. The only 
information we know about the pmf p — (px, . . . ,p m ) of X is in the form of expected 
values of the functions £j : [m] — > R, i = 1, . . . , d (we refer these functions as 'constraint 
functions'). We therefore have 

X)*i0>i = 2i ,i = l,...d , (10) 
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where Tj, % = 1, . . . , d, are assumed to be known. In an information theoretic approach 
to statistics, known as Jayens maximum entropy model, one would choose the pmf 
p G A m _i that maximize the Shannon entropy functional 

m 

S(p) = -^Pjlnpj (11) 

with respect to the constraints (IIDl) . 

The corresponding Lagrangian can be written as 

tm \ d / m \ 

Eft ' - 1 - E^ I>o>i - ^ ( 12 ) 
i=i / i=i \i=i / 

Holding £ = (£i, • • • fixed, the unconstrained maximum of Lagrangian H(p, £) over 
all p G A m _i is given by an exponential family (Cover & Thomas 1991) 



Pi (e) = z(0- l exp (-E^O')j 



J = l,...,m, (13) 



where Z(£) is normalizing constant given by 

m / d \ 

i=i V i=i / 

For various values of £ e R d , the family (|T3l) is known as maximum entropy model. 
Now, we have following proposition. 

Proposition 3.3. The maximum entropy model (T73)] is a toric model provided that the 
constraint functions are integer valued. 

Proof. Set ln#j = % = 1, . . . , d. Now, ( jTBT) gives us 

(d \ d 

-J2m^0i )=z{e)- l l[^ ) . (is) 
8=1 / 8=1 

By defining matrix A = [tj(j)] G Z dxri and setting /i = (— , ...,—) we have rational 
parametrization as in (jSJ). □ 

Note that we allowed only integer valued functions in the ME-model in the above 
proposition, which is necessary for algebraic descriptions of the same. Here we also 
mention that in the above proof by assuming h G A m _i (which acts as a prior), we can 
imply that minimum I-divergence model (Csiszar 1975) 

Pj = z(o lfi j ex p f- E fr^fr')! ' j = 1 > • • • ' m > ( 16 ) 

(with appropriate normalizing constant Z(j£)) is indeed a toric model. 

Once the specification of statistical model is done, the task is to calculate the model 
parameters with the available information. In this case the available information is in 
the form of expected valued of functions ti, i = 1, . . . d and the Lagrange parameters £j, 
i — 1, . . . , d are determined using the constrains (TTOT) . 
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4. Calculation of ME distributions via solving Polynomial equations 

4-1. Direct method 

One can show that the Lagrange parameters in ME-model f|T3|) can be estimated by 
solving following set of partial differential equations (Jaynes 1968) 

^-lnZ(£) = 7- ,i = l,...,d, (17) 

which has no explicit analytical solution. In literature there are several methods 
of estimating ME-models. One of the important method is Darroch and Ratcliff's 
generalized iterative scaling algorithm (Darroch & Ratcliff 1972). Here we can show 
that ME-models can be calculated using computational algebraic methods. 

Note that set of all distributions which satisfies (fTUl) is known as linear family (we 
denote this by C). Now, if we represent the exponential family (fT3l) by £, the set of 
statistical models that results from ME-principle can be written as C fl £ C A m _i. One 
can show that C fl £ C A m _i is a variety. 

By substituting maximum entropy distributions (|T5|) in ({TO]) we get 

m d 

j=i i=i 

which can be written as 

m d m d 

T Y.W ■ ( 19 ) 

j=l i=l j=l i=l 

The solutions of system of polynomial equations ( 1191) gives the maximum entropy model 
spcified the available information (|T0|) . We state this as a proposition. 

Proposition 4.1. The maximum entropy model ( Tiff)) can be specified by solving set of 
polynomial equations provided that the constraint functions ti, i = 1, . . .,d are integer 
valued. 

4.2. Dual Method 

Here we follow the method of dual optimization problem. By using Kuhn- Tucker 
theorem we calculate Lagrange parameters £j, i = 1, . . . , d in ( fT3l) by optimizing dual of 
S(p, £). That is the task is to find £ which maximizes 

*(0 = S(p (0 ,e) ■ (20) 

Note that ^(£) is nothing but entropy of ME-distribution ( fT3l) . We have 

d 

*(0 = lnZ + J^T, . (21) 

i=l 



This can be written as 

m / d \ d 

vl/ (0 = ln^Texp + $> T * 

j=i \ j=i J i=i 

m 

= In J2^(UTi-ti(j))) . (22) 
i=i 

Now maximizing \I/(£) is equivalent to maximizing 

m 

^(0 = ^expte(T i -t J (j))) • (23) 
By introducing ^ = ln0j, i = 1, . . . , d we have 

m d 

no)=J2U^~ m • ( 24 ) 
j=i i=i 

The solution is given by solving the following set of equations 

0,j = l,...d. (25) 



Unfortunately ^ G £;[#*,..., only if Tj G Z. Now, we consider the case where the 
expected values are available as sample means. 

In most practical problems the information in the form of expected values is 
available via sample or empirical means. That is, given a sequence of observations 
Oi,...,On the sample means T iy i = l,...,d, with respect to the functions t iy 
i = 1, . . . , d are given by 

1 - 

T i = jj y 52u(O l ),i = l i ...,d, (26) 
i=i 

and the underlying hypothesis is Tj ~ Tj. That is 

m I N 

£M0*)«^$>( O ') ,i = l,-..,d. (27) 
j=y i=i 

Now we show that, by choosing alternate Lagrangian in the place of (|12[) we can 
transform the parameter estimation of ME-model to a problem of solving set of 
polynomial (Laurent) equations. 

Proposition 4.2. Given the hypothesis (fffTTj the problem of estimating the ME-model 
in the dual method amounts to solving set of Laurent polynomial equations (assuming 
that constraint functions are integer valued). 

Proof. To retain the integer valued exponents in our final solution we consider the 
constrains of the form 

rn 

Nj2 t i(j)Pj=°* > i = l,...d , (28) 
j=i 
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where Oi = J2f =1 ti(Oi) denotes the sample sum. In this case Lagrangian is 

(m \ d / m \ 

Eft-- 1 ( N J2pMj)-°i) ■ (29) 

3=1 J i=l \ j=l J 

This results in the ME-distribution 

Pj (0 = Z(0- 1 e W ^-Nj2jMj)j , j = l,...,m, (30) 
where Z(£) is normalizing constant given by 

m / d \ 

To calculate the parameters we maximize the dual of S(p, £). That is we maximize 
the functional 

d 

*(0= In Z + J2l°i • (32) 
i=i 

It is equivalent to optimizing the functional 

m / d d 

= e ex p E - ^ E 

i=i \i=i i=i 

By setting ln#j = £j we have 

* , (^)=En^ <Ti " jvti(,)) ( 33 ) 

j=l i=l 

The solution is given by solving the following set of equations 

-^- = ,t = l,...d. (34) 

We have 

f 6*--^] ,< = l,...,d. (35) 

□ 

In algebraic statistics, algebraic descriptions are used to analyze the maximum 
likelihood estimates of exponential models (Pachter & Sturmbfels 2005). In the view 
that maximum likelihood and maximum entropy are related, it will be interesting to 
compare these two methods from algebraic statistical point of view. 
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5. Conclusion and Directions for Future research 

In this paper we attempted to describe maximum (and hence minimum) entropy model 
in algebraic statistical framework. We showed that maximum entropy models are toric 
models when the constraint functions are assumed to be integer valued functions and 
the set of statistical models results from ME-principle is indeed an variety. In a dual 
estimation we demonstrated that when the information is in the form of empirical means, 
the calculation of ME-models can be transformed to solving set of Laurent polynomial 
equations. Work on computational algebraic algorithms for estimating ME-models are 
in progress. We hope that this will also shed light on possible interesting algebraic 
structures in information theoretic statistics. 
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